2 research outputs found
A survey on bias in machine learning research
Current research on bias in machine learning often focuses on fairness, while
overlooking the roots or causes of bias. However, bias was originally defined
as a "systematic error," often caused by humans at different stages of the
research process. This article aims to bridge the gap between past literature
on bias in research by providing taxonomy for potential sources of bias and
errors in data and models. The paper focus on bias in machine learning
pipelines. Survey analyses over forty potential sources of bias in the machine
learning (ML) pipeline, providing clear examples for each. By understanding the
sources and consequences of bias in machine learning, better methods can be
developed for its detecting and mitigating, leading to fairer, more
transparent, and more accurate ML models.Comment: Submitted to journal. arXiv admin note: substantial text overlap with
arXiv:2308.0946
Targeted Data Augmentation for bias mitigation
The development of fair and ethical AI systems requires careful consideration
of bias mitigation, an area often overlooked or ignored. In this study, we
introduce a novel and efficient approach for addressing biases called Targeted
Data Augmentation (TDA), which leverages classical data augmentation techniques
to tackle the pressing issue of bias in data and models. Unlike the laborious
task of removing biases, our method proposes to insert biases instead,
resulting in improved performance. To identify biases, we annotated two diverse
datasets: a dataset of clinical skin lesions and a dataset of male and female
faces. These bias annotations are published for the first time in this study,
providing a valuable resource for future research. Through Counterfactual Bias
Insertion, we discovered that biases associated with the frame, ruler, and
glasses had a significant impact on models. By randomly introducing biases
during training, we mitigated these biases and achieved a substantial decrease
in bias measures, ranging from two-fold to more than 50-fold, while maintaining
a negligible increase in the error rate